Skip to content

fix(tuning): cap LightGBM search space for thin datasets to prevent memorisation#321

Open
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/tuning-thin-dataset-search-space
Open

fix(tuning): cap LightGBM search space for thin datasets to prevent memorisation#321
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/tuning-thin-dataset-search-space

Conversation

@drussellmrichie
Copy link
Copy Markdown

Problem

When a model group has very few training samples (e.g. <200), the Optuna tuner can select num_leaves values in the thousands. With ~101 training samples and num_leaves=1514, LightGBM effectively memorises the training folds — CV MAPE looks fine because the model can overfit each fold's tiny training set, but out-of-sample performance collapses.

Concrete example from Philadelphia AVM work: residential_mf_large has ~101 training sales. The tuner found num_leaves=1514, which degraded the ratio-study COD from ~40 to ~55 (IAAO standard ≤ 20 for residential). Manually fixing num_leaves=15 recovered COD to ~40.

The same issue can affect min_data_in_leaf: with a range of [20, 500], the upper bound can exceed the entire training fold size, making all splits illegal on the validation side.

Fix

Before constructing the Optuna search space, compute the approximate training-fold size:

n_train_per_fold = int(len(X) * (n_splits - 1) / n_splits)
max_num_leaves = max(8, min(2048, n_train_per_fold // 4))
max_min_data_in_leaf = max(2, min(500, n_train_per_fold // 4))

The // 4 rule means each leaf covers at least ~4 samples on average — a conservative but reasonable floor for a regression tree. Both upper bounds are clamped to the original maximums (2048 and 500), so large datasets are unaffected.

A diagnostic print is emitted under verbose=True when the cap takes effect.

Behaviour at key dataset sizes

n (total) n_train_per_fold (k=5) max_num_leaves max_min_data_in_leaf
101 80 20 20
300 240 60 60
500 400 100 100
2000 1600 400 400
10000+ 8000+ 2000 → clamped to 2048 500

For n ≥ ~2560 the num_leaves cap has no effect (reaches the original 2048 ceiling). For n ≥ ~2000 the min_data_in_leaf cap reaches 500 (original ceiling).

Test plan

  • Verify existing tests pass (no breakage on normal-sized datasets)
  • Smoke-test with a thin group (n < 200): confirm num_leaves in saved params is ≤ n // 5

🤖 Generated with Claude Code

…r thin datasets

With small model groups (e.g. <200 training samples), the Optuna tuner can select
num_leaves values in the thousands — severe memorisation that yields artificially low
CV MAPE but collapses out-of-sample performance. For example, with ~101 training
samples the tuner found num_leaves=1514, degrading ratio-study COD from ~40 to ~55.

Fix: before building the search space, compute n_train_per_fold ≈ n * (k-1)/k and
cap num_leaves at max(8, n_train_per_fold // 4) and min_data_in_leaf at
max(2, n_train_per_fold // 4). For large datasets (n >> 8192) the caps are above
the original upper bounds and have no effect. For thin datasets the caps prevent the
tuner from selecting tree complexities that cannot generalise.

A verbose warning is printed when the cap takes effect.
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.
A maintainer will verify your signature and confirm it here by commenting with the following sentence:


I affirm that this contributor has signed the CLA


Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

drussellmrichie pushed a commit to drussellmrichie/openavmkit that referenced this pull request Apr 15, 2026
Commits all patches from C:\projects\philly_open_avmkit\patches\ as a
persistent local commit so they survive branch switches. Previously these
patches existed only as working-directory edits and were lost when the
fix/tuning-thin-dataset-search-space PR branch was checked out cleanly.

Patches applied:
- benchmark.py: _SMRContribContext/_DS classes; do_contributions threading;
  _write_model_results slim pkl + _model_features.json sidecar; per-group
  ind_vars override via group_overrides
- data.py: cKDTree import fix (scipy >=1.14); astype(int) cast after .loc
- modeling.py: positional reset_index+concat to avoid 421k^2 cartesian join
- pipeline.py: finalize_models run_* params; compute_model_contributions();
  two-checkpoint SHAP resume flow
- sales_scrutiny_study.py: astype(str) on model_group before concatenation
- shap_analysis.py: numpy array truth-value fix; missing-feature warning+filter
- tuning.py: thin-dataset guard + dataset-scaled search-space caps (also in
  fbcad5a as upstream PR larsiusprime#321)
- utilities/cache.py: ArrowExtensionArray .sum() fix via .eq() + int()
- utilities/stats.py: median-impute NaN before sklearn/statsmodels fits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant